A. The Provider Landscape

Choosing how to access LLMs

Agenda

  • A. Provider Landscape — Proprietary vs Open-Source vs Self-Hosted
  • B. API Anatomy — How HTTP calls to LLMs work with LiteLLM
  • C. Context Windows — Cost vs. Performance trade-offs
  • D. Security — Protecting your API keys
  • E. Wrap-up — Key takeaways

Model Access Decision Tree

graph TD
    A["Your Application"] --> B{"Model Access<br/>Strategy"}
    B --> C["Proprietary APIs"]
    B --> D["Aggregator APIs"]
    B --> E["Self-Hosted"]

    C --> C1["OpenAI / GPT-4"]
    C --> C2["Anthropic / Claude"]
    C --> C3["Google / Gemini"]

    D --> D1["OpenRouter"]
    D --> D2["Together AI"]
    D --> D3["Groq"]

    E --> E1["Ollama (Local)"]
    E --> E2["vLLM (Cloud GPU)"]
    E --> E3["AWS Bedrock"]

    style B fill:#FF7A5C,stroke:#1C355E,color:#fff
    style C fill:#1C355E,stroke:#1C355E,color:#fff
    style D fill:#00C9A7,stroke:#1C355E,color:#fff
    style E fill:#9B8EC0,stroke:#1C355E,color:#fff

Provider Comparison

Aspect Proprietary Aggregators (OpenRouter) Self-Hosted
Cost $0.50–$60 / M tokens Free tiers + wholesale prices Hardware / GPU costs
Quality State-of-the-art Top proprietary + open-source Same as open-source
Latency Low (optimized infra) Slight extra hop Depends on hardware
Privacy Data sent to provider Routed via aggregator Full data control
Rate Limits Generous (paid) Restrictive (free) No limits (HW-bound)
Setup Minutes Minutes Hours to days

Why We Use LiteLLM + OpenRouter

For this course:

  1. Zero cost — Free models on OpenRouter
  2. Unified interface — One API for 100+ models
  3. Provider abstraction — Switch models by changing one string
  4. Production-ready — Built-in retries and fallbacks

In production you’ll mix providers:

  • Prototyping → OpenRouter free tier
  • Production quality → Direct OpenAI / Anthropic
  • Privacy-sensitive → Self-hosted (Ollama, vLLM)

Design with provider abstraction from day one.

B. API Anatomy

How LiteLLM routes your requests

The Request–Response Cycle

sequenceDiagram
    participant App as Your Application
    participant Lite as LiteLLM (Local)
    participant API as OpenRouter API
    participant Model as Hosted Model

    App->>Lite: completion(model="openrouter/...", messages=[...])
    Lite->>API: POST /chat/completions (OpenAI format)
    Note over Lite,API: Authorization: Bearer sk-or-v1-xxx
    API->>Model: Execute generation
    Model->>API: Return response
    API->>Lite: JSON response
    Lite->>App: Python object

Why LiteLLM?

LiteLLM translates your single call into the correct format for any provider — OpenAI, Anthropic, Groq, Ollama — without you changing your code.

Why messages?

From Session 1: LLMs are next-word predictors

A base LLM sees a stream of text and predicts what comes next — nothing more.

The problem: it needs context — who said what, and in what order — to predict the right next word.

The messages list is how we hand that context to the model:

Role What it represents Completion analogy
system Standing instructions / persona The invisible prefix the author set
user The human’s turn The last thing said before the model continues
assistant Previous model replies What the model already wrote — it continues from here

One mental model

Think of the entire messages list as a single document the model completes. Each role is just a labelled paragraph — the model’s job is to write the next assistant paragraph.

Request Structure

Install with uv pip install litellm python-dotenv, then:

import os
from dotenv import load_dotenv
from litellm import completion

load_dotenv()

response = completion(
    model="openrouter/meta-llama/llama-3-8b-instruct:free",
    messages=[
        {"role": "user", "content": "How many 'G's in 'huggingface'?"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Same format, any provider

Replace "openrouter/meta-llama/..." with "gpt-4o" or "claude-3-5-sonnet" — everything else stays the same.

Key API Parameters

Parameter What It Controls Recommended Values
temperature Randomness (0 = deterministic) 0.1–0.3 factual, 0.7–0.9 creative
max_tokens Response length limit Set based on expected output
top_p Nucleus sampling threshold 0.9 (default)
messages Conversation history List of {"role": ..., "content": ...}
model Provider/model to call "openrouter/...", "gpt-4o", etc.

C. The Context Window

Intelligence has a memory limit

Context Windows & Your API Bill

128K context ≠ 128K tokens free

Every token in your context window is charged on every request — both the history you send and the reply you receive.

KV-cache intuition: The API caches the mathematical representation of previous tokens so each new word generated is cheap. But the cache lives in GPU memory — and the context window is its size limit.

Context length Relative compute cost
1K tokens
8K tokens
128K tokens 128×

Keep system prompts concise. Don’t send full conversation history when a summary suffices.

Performance: The Needle in the Haystack

  • Lost in the Middle: LLMs often ignore facts buried in the middle.
  • Recall Degradation: As context grows, retrieval ability drops.
  • Placement Matters: Put crucial info at the start or end.

Multi-Needle & Reasoning Challenges

Retrieval is the Upper Bound

  1. Fact Volume: Retrieval accuracy drops significantly as the number of “needles” (facts) increases.
  2. Reasoning Penalty: Asking the model to think about multiple retrieved facts is harder than just finding them.
  3. Upper Bound: If the model can’t retrieve it, it can’t reason about it.

Optimization Strategies

  • Summarization: Don’t send the full history; send a “state of the union.”
  • Context Pruning: Use RAG to only send relevant snippets, not whole documents.
  • Strategic Placement: Put instructions and key constraints at the very end of the prompt (the “Recency Bias”).
  • Check Your Bill: Use litellm’s usage tracking to monitor costs in real-time.

D. Security First

Protecting your API keys

Never Hardcode Secrets

Rule #1

Never put API tokens directly in source code. Not even “just for testing.”

# ❌ NEVER do this
response = completion(model="...", api_key="sk-or-v1-abc123realtoken")

# ✅ ALWAYS do this
import os
from dotenv import load_dotenv
load_dotenv()

# LiteLLM automatically reads OPENROUTER_API_KEY from environment
response = completion(model="openrouter/meta-llama/llama-3-8b-instruct:free",
                      messages=[{"role": "user", "content": "Hello"}])

The .env Approach (Development)

Step 1: Create a .env file in your project root

# .env
OPENROUTER_API_KEY=sk-or-v1-your_token_here

Step 2: Load it securely in Python

import os
from dotenv import load_dotenv

load_dotenv()

def validate_environment():
    """Ensure required API keys are present."""
    token = os.getenv("OPENROUTER_API_KEY")
    if not token:
        raise EnvironmentError(
            "OPENROUTER_API_KEY not found. "
            "Create a .env file or set the environment variable."
        )

Critical: .gitignore

Do This Immediately

Add .env to .gitignore before your first commit. One leaked token on GitHub = account compromise.

# .gitignore
.env
.env.local
*.env

Production secrets management:

Environment Tool
Development .env + python-dotenv
CI/CD GitHub Secrets, GitLab Variables
Production AWS Secrets Manager, HashiCorp Vault, Azure Key Vault

System Environment Variables (CI/CD)

# Linux / macOS
export OPENROUTER_API_KEY=sk-or-v1-your_token_here

# Windows PowerShell
$env:OPENROUTER_API_KEY = "sk-or-v1-your_token_here"

# Windows CMD
set OPENROUTER_API_KEY=sk-or-v1-your_token_here

Why Not Just .env Everywhere?

Environment variables can leak through process listings and crash dumps. In production, use secrets managers that provide encryption, rotation, and audit logging.

E. Wrap-up

Key Takeaways

  1. Three access strategies: proprietary APIs, aggregators (like OpenRouter), self-hosted — choose based on cost, quality, latency, and privacy.
  2. LiteLLM gives you a single, unified interface to 100+ model providers.
  3. Context Windows are expensive and performance-bound—be strategic about what you include.
  4. OpenRouter provides free access to strong models—perfect for prototyping.
  5. Security is non-negotiable — use .env and secrets managers.
  6. Design for provider abstraction — make switching providers trivial.

Up Next

Lab 2: Build a production-grade LiteLLM client — from “hello world” to retry logic and caching.